Imaged Document Text Retrieval Without OCR

نویسندگان

Chew Lim Tan

Weihua Huang

Zhaohui Yu

Yi Xu

چکیده

ÐWe propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-gram based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from UW1 database confirms the validity of the proposed method. Index TermsÐDocument image analysis, document vector, text similarity, text

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optical Font Recognition from Projection Profiles

• Recognition of logical document structures [1], where knowledge of the font used in a word, line, or text block may be useful for defining its logical label (chapter title, section title or paragraph). • Document reproduction, where knowledge of the font is necessary in order to reproduce (reprint) the document. • Document indexing and information retrieval, where word indexes are generally p...

متن کامل

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...

متن کامل

Document Retrieval In OCR-Scanned Text

The use of a document retrieval system (PADRE) for the Fujitsu AP1000 in processing known-item search queries over OCR-scanned documents is reported. Retrieval performance of an initial set of queries is shown to deteriorate signi cantly over scanned data with a character error rate of 5%. A preprocessor is used to augment queries with terms which can be derived from original terms using charac...

متن کامل

Novel Arabic OCR Degraded Text Retrieval Model

This paper provides a novel model enhances the Arabic OCR degraded text retrieval effectiveness. The model simulates the Arabic OCR recognition mistakes happens while the recognition process based on word based approach. Then using the expected OCR errors the model expands the user search query. The resulting expanded search query produced higher precision and recall in searching Arabic OCRDegr...

متن کامل

RMIT University at TREC 2008: Legal Track

This paper reports on the participation of RMIT university in the 2008 TREC Legal Track Ad Hoc task. OCR errors can corrupt the document view formed by an information retrieval system, and substantially hinder the successful retrieval of relevant documents for user queries. In previous research, the presence of errors in OCR text was observed to lead to unstable and unpredictable retrieval effe...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

IEEE Trans. Pattern Anal. Mach. Intell.

دوره 24 شماره

صفحات -

تاریخ انتشار 2002

Imaged Document Text Retrieval Without OCR

نویسندگان

چکیده

منابع مشابه

Optical Font Recognition from Projection Profiles

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

Document Retrieval In OCR-Scanned Text

Novel Arabic OCR Degraded Text Retrieval Model

RMIT University at TREC 2008: Legal Track

عنوان ژورنال:

اشتراک گذاری